© USO—iStock/Getty Images
© USO—iStock/Getty Images

1. Introduction

1.1 Background and Motivation

At first, we wanted to analyse the impacts of the recent global pandemic on individuals within the context of a variety of mental health conditions, namely anxiety and depression, in a pre- and post-COVID framework. However, we decided to drop the idea due to a lack of data available regarding the topic, which wouldn’t have allowed us to conduct a thorough analysis. As we were searching for new topic ideas, we found a dataset that concerns shark attacks around the world. A group member has then told us about an incident that occurred during his last holidays in Egypt, involving a shark attack at a paradisiac beach. We discussed thereafter the presence of these predators in many different regions in the world including Florida, Hawaii, and Australia just to name a few. For this reason, we decided to dive deeper into the topic of shark attacks and explore potential influencing factors. After a few days of research and information gathering, we found that a wide range of factors are capable of influencing such incidents, both directly and indirectly. We all enjoy going on holidays and enjoy beaches while drinking a cocktail, and when we feel that temperature increases, we go in the water to refresh, but depending on the location we are, this could be unsafe.

Another motivation behind this project is our interest in the preservation of marine fauna. Very often human beings tend to primarily blame the animals for their fierce behavior (just look at the example of the attack that occurred in Hurghada during summer of 2023: after the death of the victim, the shark was killed!) but we often forget that the water it is their natural habitat, and that we are the guests. This point motivates us even more to study the phenomenon of attacks and to disclose how human activity causes shark attacks.

1.2 Project Objective

The project’s focus is to provide an understanding of the complex dynamics of shark attacks globally, by conducting an in-depth analysis on the different factors that influence the phenomenon. In addressing the topic at hand, we explore both direct and indirect influencing factors.

At the end of our work we want to learn how it is possible to reduce accidents caused by sharks. On the one hand, this is crucial to guarantee the safety of people who take pleasure in coastal waters and engage in water-related activities. On the other hand, the conservation of aquatic animals might benefit from this. Making efforts to preserve these species is essential for sustaining the health of marine ecosystems in a time when the extinction rate of many animal species is rising. Among the factors that might have an impact on the trend of shark attacks we have climatic variations. Ocean temperatures and currents are capable of affecting migration patterns and habitat preferences, which further complicates the dynamics of shark attacks. By providing a step-by-step analysis on climate-related aspects, we aim to provide empirical evidence on the impact of environmental changes on shark movements and interactions with humans.

This project’s goal is to contribute to the understanding of shark-related incidents and thereby to provide possible strategies such as policymaking, which would help mitigate such occurrences, ultimately safeguarding human life and protecting sharks. In essence, we aim to advance research on sustainable coexistence with the marine environment through actionable insights and intend to contribute to the broader efforts of marine conservation.

1.4 Research Questions

  1. What are the primary factors influencing the occurrence of shark attacks?
  2. Is it true that shark attacks have been increasing with climate change?
  3. What are the main factors leading to the fatality of a shark attack?

2. Data Sets

In the second section of our work, we will present the datasets that are necessary to answer our research questions. These datasets serve as the foundation for our investigation and provide the data required to draw insightful and important findings. The purpose of this part is to provide a clear overview of the data landscape and make clear to the reader what are the information to be found in each web-based data, and how do they interrelate.

2.1 Shark Attacks Data Set

Our first dataset contains the information in over 6000 shark attacks that have occurred worldwide over the course of the past centuries. The dataset was obtained from the Shark Research Institute’s (SRI) website, which carries out and funds rigorous, peer-reviewed shark field research and instructs people using scientific data.

Even though the dataset is owned by the SRI, we have downloaded it through the following plateform: https://www.kaggle.com/datasets/teajay/global-shark-attacks.


# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable

Variables <- c("Case Number", "Date", "Year", "Type", "Country", "Area", "Location", "Activity", "Name", "Sex", "Age", "Injury", "Fatal", "Time", "Species", "Investigator.or.Source", "pdf", "href.formula", "Case Number 2")
Description <- c("Case Number of accident", "Date of the accident", "Year of accident", "Whether the victim provoked the shark or not", "Country of accident", "Region", "Location", "Activity the victim was doing before accident","Name of the victim", "Sex of the victim", "Age of the victim","Injury provoked by the accident", "Whether the accident led to death or not (Binary variable)", "Time of accident", "Shark Species", "Investigator of the accident", "Pdf information of the accident", "Pdf link of the accident", "Case number (same aas the first one)")


Table1 <- data.frame(Variable = Variables, Description = Description)

# The knitr::kable() allows to visualize the table
knitr::kable(Table1)
Variable Description
Case Number Case Number of accident
Date Date of the accident
Year Year of accident
Type Whether the victim provoked the shark or not
Country Country of accident
Area Region
Location Location
Activity Activity the victim was doing before accident
Name Name of the victim
Sex Sex of the victim
Age Age of the victim
Injury Injury provoked by the accident
Fatal Whether the accident led to death or not (Binary variable)
Time Time of accident
Species Shark Species
Investigator.or.Source Investigator of the accident
pdf Pdf information of the accident
href.formula Pdf link of the accident
Case Number 2 Case number (same aas the first one)

2.2 Temperatures Data Set

Our second dataset contains the average temperature change for each country (annually, monthly and seasonally) from 1961 to 2022. It contains 9,657 observations.

The dataset was taken from the website of the Food and Agriculture Organization of United Nations: https://www.fao.org/faostat/en/#data/ET


# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable

Variables <- c("Area Code", "Area", "Months ", "Month code ", "Element Code ", "Element", "Unit", "Years")
Description <- c("Code for each country", "Country for which the data has been collected", "Month during which the data on temperature has been collected", "Code of each month ", "Binary variable (7271: Temperature change; 6078: Standard Deviation)", "What is the data measuring (either temperature change or standard deviation of data collection)", "Unit of measure of temperature ", "Year of observations" )


Table2 <- data.frame(Variable = Variables, Description = Description)

# The knitr::kable() allows to visualize the table
knitr::kable(Table2)
Variable Description
Area Code Code for each country
Area Country for which the data has been collected
Months Month during which the data on temperature has been collected
Month code Code of each month
Element Code Binary variable (7271: Temperature change; 6078: Standard Deviation)
Element What is the data measuring (either temperature change or standard deviation of data collection)
Unit Unit of measure of temperature
Years Year of observations

2.3 Sea Level Data Set

Our third dataset contains information on the global increase of sea level from 1993 to 2021. The following acronims are important to be noted: GMLS:Global mean sea level GIA: Global Isostatic Adjustment, which is the ongoing movement of land once burdened by ice-age glaciers (National Ocean Service, n.d.)

The dataset was taken from the website of the NASA: https://climate.nasa.gov/vital-signs/sea-level/


# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable

Variables <- c("Year", "TotalWeightedObservations ", "GMSL_noGIA ", "StdDevGMSL_noGIA ", "SmoothedGSML_noGIA ", "GMSL_GIA", "StdDevGMSL_GIA", "SmoothedGSML_GIA", "SmoothedGSML_GIA_sigremoved")
Description <- c("Year of the observation", "Total Weighted Observations of Sea Level", "Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (without considering the GIA)", "Standard Deviation of global mean sea level variation estimate in millimeters (without considering the GIA)", "Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (without considering the GIA)", "Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (considering the GIA)", "Standard Deviation of global mean sea level variation estimate in millimeters (considering the GIA)", "Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (considering the GIA)", "Smoothed (60-day Gaussian type filter) Global Mean Sea Level variation in millimeters (considering the GIA) ; annual and semi-annual signal removed ")


Table3 <- data.frame(Variable = Variables, Description = Description)

# The knitr::kable() allows to visualize the table
knitr::kable(Table3)
Variable Description
Year Year of the observation
TotalWeightedObservations Total Weighted Observations of Sea Level
GMSL_noGIA Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (without considering the GIA)
StdDevGMSL_noGIA Standard Deviation of global mean sea level variation estimate in millimeters (without considering the GIA)
SmoothedGSML_noGIA Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (without considering the GIA)
GMSL_GIA Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (considering the GIA)
StdDevGMSL_GIA Standard Deviation of global mean sea level variation estimate in millimeters (considering the GIA)
SmoothedGSML_GIA Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (considering the GIA)
SmoothedGSML_GIA_sigremoved Smoothed (60-day Gaussian type filter) Global Mean Sea Level variation in millimeters (considering the GIA) ; annual and semi-annual signal removed

2.4 Greenhouse Gas Emissions Data Set

Our fourth dataset relates to GHG emissions, one of the main concerns linked to global warming. The dataset contains 31.315 observations on annual GHG emissions per country and per continent, but we will focus on the first case only.

The dataset was taken from the website Our World in Data: https://ourworldindata.org/annual-co2-emissions


# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable

Variables <- c("Entity", "Code", "Year", "Annual COâ‚‚ emissions")
Description <- c("Country examinated", "ISO Code", "Year of the observation", "GHG emissions per year")


Table4 <- data.frame(Variable = Variables, Description = Description)

# The knitr::kable() allows to visualize the table
knitr::kable(Table4)
Variable Description
Entity Country examinated
Code ISO Code
Year Year of the observation
Annual COâ‚‚ emissions GHG emissions per year

3. Data Wrangling

The first step of our work is data cleaning. While collecting data for a study, it might happen to have some variables with missing information, worthless data or spelling mistakes In order to run a regression, it is fundamental that each column within a data set has no missing values and that information is written following the same format. This procedure takes the name of data cleaning, and it will cover the whole subsection.

Please be aware that at the beginning of each subsection we only provide general instructions of our cleaning. Through the code, you will get additional in-depth explanations of each step we took.

3.1 Attacks Cleaning

The first dataset we will examine is the one pertaining to shark attacks. It was quite full of misspellings and missing values, which required a lot of time and a very dense code to be able to clean it.

For our work, we have decided to only study the phenomenon of shark attacks from 1970 forward. In fact, it is possible to observe in the raw dataset that, before to this date, observations had more erroneous information. This leads one to believe that, more likely than not, the information we needed is more accurate if we do not travel too far back in time.

While cleaning the column of Ages, we realized that NA were too much and we could not delete them all, because that would have meant losing 29,5% of our data. Therefore, we have decided to work in the following way: for each country that presents LESS NA than the total observation, we compute an histogram (Age against frequency) of non-missing data. If the distribution is normal, we substitute the NA with the mean. If it is left skewed, we substitute the NAs with the first quantile, if it is right skewed, we substitute it with the third quantile.

Here is an example with USA and Bahamas. For USA, the distribution of the non-missing victims’ ages is skewed on the left. Therefore we have decided to substitute all the missing values with the first quantile. For Bahamas, it is evident that data follow a normal distribution, therefore we replace NAs with the mean.

#EGYPT#

ages_in_egy <- attacks %>%
  filter(Country == "EGYPT") %>%
  select(Age)

ages_vector <- ages_in_egy$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_egy = mean(ages_without_na)

attacks$Age[attacks$Country == "EGYPT" & is.na(attacks$Age)] <- mean_egy


#ITALY#

# Extract all ages from individuals who come from Italy
ages_in_italy <- attacks %>%
  filter(Country == "ITALY") %>%
  select(Age)

# Convert the extracted ages to a vector
ages_vector <- ages_in_italy$Age


# Display or use the 'ages_vector' containing ages from individuals in Italy

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_age = mean(ages_without_na)

# Replace all the NA on Italy with the mean of people attacked in Italy.

attacks$Age[attacks$Country == "ITALY" & is.na(attacks$Age)] <- 37.375

#AUSTRALIA#

ages_in_aus <- attacks %>%
  filter(Country == "AUSTRALIA") %>%
  select(Age)

ages_vector <- ages_in_aus$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_aus = mean(ages_without_na)

attacks$Age[attacks$Country == "AUSTRALIA" & is.na(attacks$Age)] <- mean_aus



#SOUTH AFRICA#

ages_in_SA <- attacks %>%
  filter(Country == "SOUTH AFRICA") %>%
  select(Age)

ages_vector <- ages_in_SA$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_SA = mean(ages_without_na)

attacks$Age[attacks$Country == "SOUTH AFRICA" & is.na(attacks$Age)] <- mean_SA


#BRAZIL#

ages_in_BRA <- attacks %>%
  filter(Country == "BRAZIL") %>%
  select(Age)

ages_vector <- ages_in_BRA$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_BRA = mean(ages_without_na)

attacks$Age[attacks$Country == "BRAZIL" & is.na(attacks$Age)] <- mean_BRA


#FIJI#

ages <- attacks %>%
  filter(Country == "FIJI") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "FIJI" & is.na(attacks$Age)] <- mean

#FRENCH POLYNESIA#

ages <- attacks %>%
  filter(Country == "FRENCH POLYNESIA") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "FRENCH POLYNESIA" & is.na(attacks$Age)] <- mean

#HONG KONG#

ages <- attacks %>%
  filter(Country == "HONG KONG") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "HONG KONG" & is.na(attacks$Age)] <- mean

#JAPAN#

ages <- attacks %>%
  filter(Country == "JAPAN") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "JAPAN" & is.na(attacks$Age)] <- mean

#MEXICO#

ages <- attacks %>%
  filter(Country == "MEXICO") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "MEXICO" & is.na(attacks$Age)] <- mean


#MOZAMBIQUE#

ages <- attacks %>%
  filter(Country == "MOZAMBIQUE") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "MOZAMBIQUE" & is.na(attacks$Age)] <- mean


#NEW ZELAND#

ages <- attacks %>%
  filter(Country == "NEW ZEALAND") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "NEW ZEALAND" & is.na(attacks$Age)] <- mean

#PHILIPPINES#

ages <- attacks %>%
  filter(Country == "PHILIPPINES") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "PHILIPPINES" & is.na(attacks$Age)] <- mean

#REUNION#

ages <- attacks %>%
  filter(Country == "REUNION") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "REUNION" & is.na(attacks$Age)] <- mean

#NEW CALEDONIA#

ages <- attacks %>%
  filter(Country == "NEW CALEDONIA") %>%
  select(Age)

ages_vector <- ages$Age

ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)

attacks$Age[attacks$Country == "NEW CALEDONIA" & is.na(attacks$Age)] <- mean


#We fixed those countries that had lots of NA. Now, for the remaining NA we decided not to do anything for a specific reason. Not only we have not cound the real information on the internet, but we also believe that doing the mean for those remaining situations was useless, because these are the cases when NA are either the same amount of total shark attacks (see Antigua with 1 shark attack and 1 NA), or NA are more than half of the total attacks (see Malaysia with 4 total attacks but 3 of them are NA).


count_na_by_country <- attacks %>%
  group_by(Country) %>%
  summarise(NA_count = sum(is.na(Age)))


attacks <- subset(attacks, !is.na(Age))

#___________________________________________________________________________________________________________________


#NOW WORK ON TIME

#sometimes hours are written in a different format compared to most of the others, which follow the format
#13h30. Since they are not a lot we just replaced them manually

attacks<- mutate(attacks, Time= ifelse(Time=="Possibly same incident as 2000.08.21", "", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="14h00  -15h00", "14h30", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="14h30 / 15h30", "15h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="09h30 / 10h00", "9h45", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="10h45-11h15", "11h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="Sometime between 06h00 & 08hoo", "7h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="18h15-18h30", "18h20", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="09h00 - 09h30", "9h15", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="10h00 -- 11h00", "10h30", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="11h115", "11h15", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "Between 05h00 and 08h00", "6h30", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "17h00 or 17h40", "17h20", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "13h345", "13h45", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "<a0> ", "", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "09h00 -10h00", "9h30", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "2 hours after Opperman", "", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "11h00 / 11h30", "11h15", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "Between 06h00 & 07h20", "6h40", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "30 minutes after 1992.07.08.a", "", Time))



#now, our objective is to classify hours in parts of the day. indeed, it is useless to keep hours as they are
#for a regression because we want times like 7h30 and 7h00 OR 15h00 and 16h00 to be read as the same thing.
# some of the rows already had the part of the day in it, but in order to work easily with all the column, we
#replaced "morning" with 8h00 etc. in this way, we're able to remove all the strings that are useless. finally, 
#void rows have been replaced by an NA and hours, which were there as CHAR have been replaced by numeric

attacks$Time <- str_replace_all(attacks$Time, "\\bmorning\\b", "8h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bafternoon\\b", "15h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bevening\\b", "20h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bnight\\b", "23h00")

attacks$Time <- gsub("[^[:digit:]]", "", attacks$Time)
attacks$Time = na_if(attacks$Time, "")
attacks$Time <- as.numeric(attacks$Time)

# since we removed all the letters, hours now are not written as 8h00 or 15h30 but as 800 and 15h00. this is
# great for us, so that we can easily create a function that classifies those numbers as parts of the day. the 
#function works this way: if a value is included between 0 and 500 (i.e. midnight and 5a.m.), the value is replaced
#by the word "night" etc.

timeoftheday <- function(time) {
  if (!is.na(time)) {
    if (time >= 0 && time < 500) {
      return("night")
    } else if (time >= 500 && time < 1200) {
      return("morning")
    } else if (time >= 1200 && time < 1730) {
      return("afternoon")
    } else if (time >= 1730 && time < 2400) {
      return("evening")
    } else {
      return("")  # Handle any other cases (if necessary)
    }
  } else {
    return(NA)  # Retain NA values
  }
}

attacks$Time <- sapply(attacks$Time, timeoftheday)
attacks$Time <- tolower(attacks$Time)
attacks$Time <- na_if(attacks$Time, "")


sum(is.na(attacks$Time)) #this is the only one that still presents 1415 NA. we dont want to delete
#them all coz we'd lose 44% of our data. 
table(attacks$Time)#table shows that most of them happen in the afternoon, while 2ns place is owned by
#morning. To confirm the higher frequency of attacks between 8am and 6pm is this artile (link JC found??)
#we explain this by the fact that, naturally, most of people swim during the day. therefore, what we do
#is replacing NA randomly by the proportion of afternoon, morning and evening.

sum(!is.na(attacks$Time))

#create function that substitutes the NA in time with one of the 4 parts of the day, based on their
#proportion presence in our dataset

non_na_time_proportions <- table(attacks$Time) / sum(!is.na(attacks$Time))


# Identify NA positions
position_of_na <- is.na(attacks$Time)

# Generate random values based on proportions
attacks$Time[position_of_na] <- sample(
  names(non_na_time_proportions),
  sum(position_of_na),
  replace = TRUE,
  prob = as.vector(non_na_time_proportions)
)


#_____________________________________________________________________________________________
#NOW WE WORK ON SEX

attacks$Sex <- gsub("lli", "", attacks$Sex)
attacks$Sex <- gsub("M ", "M", attacks$Sex)
attacks$Sex <- na_if(attacks$Sex, "")
attacks$Sex <- ifelse(is.na(attacks$Sex), "Unknown", attacks$Sex)

#__________________________________________________________________________________________________________________


#NOW WE WORK ON SHARK SPECIES


#Creation of a variable species where we delete all the numbers that concern the size of the shark
attacks$Species <- str_remove_all(attacks$Species, "[[:digit:]]")

#If there is an empty cell, we will put NA
species = na_if(attacks$Species, "")

#gsub to take off punctuation
species <- gsub("[[:punct:]]", "", species)

#Delete all the numbers followed by cm or m
species <- gsub("\\d+\\s*(cm|m)\\b", "", species)

#Delete word composed by 1 or 2 letters
species <- gsub("\\b\\w{1,2}\\b", "", species)

#Delete all the numbers
species <- gsub("\\d", "", species)

#Delete all unities as lb, kg or l
species <- gsub("\\b(kg|lb|b)\\b", "", species)

#If the word shark appears more than once in each cell, we only write shark once
species <- gsub("\\bshark\\b(.*\\bshark\\b)?", "shark", species, ignore.case = TRUE)


# Rewrite shark species names in one way, and do that for each species
species<- gsub("^.*White shark.*$", "White shark", species)
species<- gsub("^.*whitetip shark.*$", "White shark", species)
species<- gsub("^.*white shark.*$", "White shark", species)
species<- gsub("^.*Wobbegong shark.*$", "Wobbegong shark", species)
species<- gsub("^.*Wobbegong.*$", "Wobbegong shark", species)
species<- gsub("^.*Zambesi shark.*$", "Zambesi shark", species)
species<- gsub("^.*Zambezi shark.*$", "Zambesi shark", species)
species<- gsub("^.*whaler shark.*$", "Whale shark", species)
species<- gsub("^.*whale shark.*$", "Whale shark", species)
species<- gsub("^.*tiger shark.*$", "Tiger shark", species)
species<- gsub("^.*Tiger shark.*$", "Tiger shark", species)
species<- gsub("^.*thresher shark.*$", "Thresher shark", species)
species<- gsub("^.*spinner shark.*$", "Spinner shark", species)
species<- gsub("^.*Spinner shark feet.*$", "Spinner shark", species)
species<- gsub("^.*Spinner shark.*$", "Spinner shark", species)
species<- gsub("^.*spinner  blacktip shark.*$", "Spinner shark", species)
species<- gsub("^.*Tawny nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*Nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*Mako shark.*$", "Mako shark", species)
species<- gsub("^.*mako shark.*$", "Mako shark", species)
species<- gsub("^.*Lemon shark.*$", "Lemon shark", species)
species<- gsub(".*lemon shark.*", "Lemon shark", species, ignore.case = TRUE)
species<-gsub("^\\s*shark\\s*$", "Unidentified shark", species)
species<- gsub(".*bull.*", "Bull shark", species, ignore.case = TRUE)
species<- gsub(".*blue.*", "Blue shark", species, ignore.case = TRUE)
species<- gsub(".*reef.*", "Reef shark", species, ignore.case = TRUE)
species<- gsub(".*sand shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub("^.*Sand shark.*$", "Sand shark", species)
species<- gsub(".*Sand shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub(".*sandshark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub(".*Sandbar shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub("juvenile\\s+\\w+", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("Juvenile shark shark", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("Juvenile shark  blacktip shark", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("blacktip\\s+\\w+", "Blacktip shark",species, ignore.case = TRUE)
species <- gsub("\\black\\w*\\b", "Blacktip shark", species, ignore.case = TRUE)

#Replace all occurrences of the word "blacktip" in a column with "Blacktip Shark," even if the word is accompanied by other words before or after

species <- gsub("\\bblacktip\\b", "Blacktip Shark", species, ignore.case = TRUE)
species<- gsub(".*copper shark*", "Copper shark", species, ignore.case = TRUE)
species <- gsub("\\bcow\\b", "Cow", species)
species <- gsub("\\bsilky\\b", "Silky", species)
species <- gsub("\\bsilvertip\\b", "Silvertip", species)


#Condition: if "Hammerhead" followed by other words then replaced by Hammerhead shark
species <- ifelse(grepl("Hammerhead\\s+\\w+", species, ignore.case = TRUE), "Hammerhead shark", species)
species <- ifelse(grepl("Blacktip\\s+\\w+", species, ignore.case = TRUE), "Blacktip shark", species)
species <- ifelse(grepl("Raggedtooth\\s+\\w+", species, ignore.case = TRUE), "Raggedtooth shark", species)
species <- ifelse(grepl("Porbeagle\\s+\\w+", species, ignore.case = TRUE), "Porbeagle shark", species)


#if the word shark does not appear then NA
species <- ifelse(grepl("shark", species, ignore.case = TRUE), species, NA)
species <- ifelse(grepl("Juvenile\\s+\\w+", species, ignore.case = TRUE), "Juvenile shark", species)
species <- ifelse(grepl("grey\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("greycolored\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("gray\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("Broadnose\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("7gill\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("sevengill\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("black\\s+\\w+", species, ignore.case = TRUE), "Blacktip shark", species)
species <- ifelse(grepl("Sand\\s+\\w+", species, ignore.case = TRUE), "Sand shark", species)
species <- ifelse(grepl("Carpet\\s+\\w+", species, ignore.case = TRUE), "Carpet shark", species)
species <- ifelse(grepl("\\bbrown\\b", species, ignore.case = TRUE), "Brown shark", species)

species <- gsub("^frag\\w+", "Unidentified shark", species)
species <- gsub("^shark$", "Unidentified shark", species)

#Function to check if "small" is present in cell
contains_small <- function(text) {
  return(grepl("small", text))
}
# Replace cell with "unidentified shark" if "small" is present
species <- ifelse(sapply(species, contains_small), "Unidentified shark", species)

#Function to check if "Small" is present in cell
contains_small <- function(text) {
  return(grepl("Small", text))
}
#  Replace cell with "unidentified shark" if "Small" is present
species <- ifelse(sapply(species, contains_small), "Unidentified shark", species)

#Function to check if "sharks" is present in cell
contains_sharks <- function(text) {
  return(grepl("sharks", text))
}
# Replace cell with"unidentified shark" if "sharks" is present
species <- ifelse(sapply(species, contains_sharks), "Unidentified shark", species)

#Function to check if "cookie" is present in cell
contains_cookie <- function(text) {
  return(grepl("cookie", text))
}
# Replace cell with"Cookiecutter shark" if "cookie" is present
species <- ifelse(sapply(species, contains_cookie), "Cookiecutter shark", species)


#Function to check if "involvement" is present in cell
contains_involvement <- function(text) {
  return(grepl("involvement", text))
}
# Replace cell with"No shark" if "involvement" is present
species <- ifelse(sapply(species, contains_involvement), "No shark", species)

#Function to check if "invovlement" is present in cell
contains_invovlement <- function(text) {
  return(grepl("invovlement", text))
}
# Replace cell with"No shark" if "invovlement" is present
species <- ifelse(sapply(species, contains_invovlement), "No shark", species)

#Function to check if "Not" is present in cell
contains_Not <- function(text) {
  return(grepl("Not", text))
}
# Replace cell with"No shark" if "Not" is present
species <- ifelse(sapply(species, contains_Not), "No shark", species)

#Function to check if "not" is present in cell
contains_not <- function(text) {
  return(grepl("not", text))
}
# Replace cell with"No shark" if "not" is present
species <- ifelse(sapply(species, contains_not), "No shark", species)

#Function to check if "Questionable" is present in cell
contains_Questionable <- function(text) {
  return(grepl("Questionable", text))
}
# Replace cell with"No shark" if "Questionable" is present
species <- ifelse(sapply(species, contains_Questionable), "No shark", species)

#Function to check if "questionable" is present in cell
contains_questionable <- function(text) {
  return(grepl("questionable", text))
}
# Replace cell with"No shark" if "questionable" is present
species <- ifelse(sapply(species, contains_questionable), "No shark", species)


#Function to check if "Salmon" is present in cell
contains_Salmon <- function(text) {
  return(grepl("Salmon", text))
}
# Replace cell with"Salmon shark" if "Salmon" is present
species <- ifelse(sapply(species, contains_Salmon), "Salmon shark", species)


#Function to check if "whaler" is present in cell
contains_whaler <- function(text) {
  return(grepl("whaler", text))
}
# Replace cell with"Whale shark" if "involvement" is present
species <- ifelse(sapply(species, contains_whaler), "Whale shark", species)


# Replace the cell with "Unidentified shark" if any of the specified words are found
species <- ifelse(grepl("(seen|observed|Tooth|tooth|large|killed|captive|female|metre|foot|followed|caused|Said|young|probably|gaffed)", species, ignore.case = TRUE), "Unidentified shark", species)

# Replace the cell with "No shark" if any of the specified words are found
species <- ifelse(grepl("(hoax|No Shark)", species, ignore.case = TRUE), "No shark", species)

# Replace the cell with "Copper shark" if any of the specified words are found
species <- ifelse(grepl("(Copper)", species, ignore.case = TRUE), "Copper shark", species)

# Replace the cell with "Dogfish shark" if any of the specified words are found
species <- ifelse(grepl("(Dog|dogfish)", species, ignore.case = TRUE), "Dogfish shark", species)

# Replace the cell with "Dusky shark" if any of the specified words are found
species <- ifelse(grepl("(Dusky)", species, ignore.case = TRUE), "Dusky shark", species)

# Replace the cell with "Sevengill shark" if any of the specified words are found
species <- ifelse(grepl("(gill)", species, ignore.case = TRUE), "Sevengill shark", species)

# Replace the cell with "Angel shark" if any of the specified words are found
species <- ifelse(grepl("(Angel)", species, ignore.case = TRUE), "Angel shark", species)


attacks$Species <- species
attacks$Species <- ifelse(is.na(attacks$Species), "Unidentifies shark", attacks$Species)

#_____________________________________________________________________________________________

#ACTIVITY

#when trying to put everything in lower cap, this was the result: Errore in tolower(attacks$Activity) :  multibyte string 921 not valid --> therefore, we converted everything in ASCII (=American Standard Code for Information Interchange)

attacks$Activity <- iconv(attacks$Activity, to = "ASCII", sub = " ")
attacks$Activity <- gsub("[^ -~]", "", attacks$Activity)

attacks$Activity <- tolower(attacks$Activity)

#when running a table of all activities, we can see that we can group them in some categories: diving, race, windsurfing, walking, wading, wade fishing, touching, swimming, surfing, surf, standing, spearfishing, snorkeling,  scuba diving, playing, paddle, murder, kayaking, kayak, floating, fishing, so that we have standard categories.

attacks$Activity <- gsub("[^A-Za-z ]", "", attacks$Activity) 


kept_activities <- c("diving", "race", "windsurfing", "walking", "wading", "wade fishing", "touching", "swimming", "surfing", "surf", "standing", "spearfishing", "snorkeling", "scuba diving", "playing", "paddle", "murder", "kayaking", "kayak", "floating", "fishing")


# We create an expression pattern that matches any of the specific words
pattern <- paste(kept_activities, collapse = "|")

# Extract only the specific words (kept_activities) from the column and replace the rest with an empty string
attacks$Activity <- str_extract(attacks$Activity, paste0("\\b(?:", pattern, ")\\b"))

attacks <- attacks %>%
  mutate(Activity = str_replace_all(Activity, "\\bkayaking\\b", "kayak")) %>%
  mutate(Activity = str_replace_all(Activity, "\\bsurfing\\b", "surf"))

attacks$Activity[is.na(attacks$Activity)] <- "other" # for all activities that are not part of the list or for NA, we just put Other, since no information is available


#_____________________________________________________________________________________________

#FATAL 
 #When we run a table for the Fatality column, we see that almost observation follow the same category (Y = yes, N = No, Unknwn). An observation has a 2017, which is probably a typo (which we replace by empty case) while another one has an M. We want to believe that it was a typo as well, and that "M" was typed instead of "N".
attacks$Fatal..Y.N. <- gsub("2017", "", attacks$Fatal..Y.N.)
attacks$Fatal..Y.N. <- gsub("M", "N", attacks$Fatal..Y.N.)

attacks$Fatal..Y.N. <-na_if(attacks$Fatal..Y.N., "")

#For the rest of NA, we have no information, so we took of these few observations.
attacks <- subset(attacks, !is.na(Fatal..Y.N.))
#Replace Yes by 1, 0 otherwise
attacks$Fatal..Y.N. <- ifelse(attacks$Fatal..Y.N. == "Y", 1, 0)

names(attacks)[names(attacks) == "Fatal..Y.N."] <- "Fatality"


#_____________________________________________________________________________________________
#final check up:

#We run a table for all columns, in order to check if columns have any missing value or if we still have other mistakes. 

table(attacks$Date) #this one is fine
table(attacks$Year) #this one is fine
table(attacks$Type) #here we have boating and boat which mean the same thing. Let us group them
attacks$Type <- gsub("Boating", "Boat", attacks$Type)
attacks$Type <- gsub("Boatomg", "Boat", attacks$Type)
table(attacks$Type) #this one is fine
table(attacks$Country)#this one is fine
table(attacks$Activity)#this one is fine
table(attacks$Age)#this one is fine
table(attacks$Fatality)#this one is fine
table(attacks$Species)#this one is fine


#With the following code, we will add a new column which will associate each country with its ISO code. This is a very important step for the cleaning if the enxt sections, because il will allow us to match the different datasets. 

# Get ISO country codes
iso_codes <- countrycode(attacks$Country, "country.name", "iso3c")
attacks$ISO_Code <- countrycode(attacks$Country, "country.name", "iso3c")

#Through the following function we can spot NAs. We can see that there are 23 countries with no ISO Code. We will fix them by hand. 
rows_with_na <- which(is.na(attacks$ISO_Code))

attacks$ISO_Code[attacks$Country %in% c("ENGLAND", "SCOTLAND", "BRITISH ISLES") & is.na(attacks$ISO_Code)] <- "GB"
attacks$ISO_Code[attacks$Country %in% c("AZORES") & is.na(attacks$ISO_Code)] <- "PRT"
attacks$ISO_Code[attacks$Country %in% c("ST. MAARTIN", "ST. MARTIN") & is.na(attacks$ISO_Code)] <- "MAF"
attacks$ISO_Code[attacks$Country %in% c("OKINAWA") & is.na(attacks$ISO_Code)] <- "JPN"
attacks$ISO_Code[attacks$Country %in% c("MICRONESIA") & is.na(attacks$ISO_Code)] <- "FSM"
attacks <- subset(attacks, !is.na(ISO_Code))

DT::datatable(attacks, options = list(pageLength = 10))
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes

3.2 Temperatures Cleaning

The main mission for this dataset was to change the entire structure of the dataset. Originally, the dataset had the years as columns (one column for each year), while the rows were represented by the different countries; the temperature observations were presented by month, quarter and year. Furthermore, for each country and each month/quarter/year, the dataset shows both the change in temperatures and the standard deviation.

To proceed with the cleaning, we eliminated the rows corresponding to the standard deviation (we will not work with it), and the monthly and quarterly observations: we are only interested in yearly temperature variations.

We then rotated the dataset so as to have all the years under a single column; we subsequently matched each country and year with its temperature change observation. Finally we added the ISO code to merge with the other datasets.


temperature <- read_xlsx(here::here("data/Temperature.xlsx"))

#R did not read the "°C" on the column Unit, so we changed it manually

temperature <- temperature %>% mutate(Unit = "°C")

#We take off all the columns that we do not need for our study

temperature <- temperature %>% select(-'Area Code (M49)', -'Months Code', -'Element Code')

#We wliminate all columns having an F at the end (For each year there are 2 columns (ex: 1961 - 1961F; 1962 - 1962F). The column with the F at the end of the Year is a column where only the letter E appears for each row. We tried to look for its meaning on the FAO website but we did not succeed. Therefore, we removed it. What is more, it is important to notice that it was in any case unimportant for our work).

columns_to_remove <- grep("F$", names(temperature), value = TRUE)
temperature <- temperature[, !(names(temperature) %in% columns_to_remove)]

# We keep only meteorological year, we don't need to work with monthly and seasonal data.

target_name <- "Meteorological year"
temperature <- temperature[temperature$Months == target_name, ]

target_name2 <- "Temperature change"
temperature <- temperature[temperature$Element == target_name2, ]



#  We take off the rest of the columns that we do not need

temperature <- temperature %>% select(-'Area Code', -'Months', -'Element', -'Unit')

#The column of each year starts with an Y (Y1961, Y1961...). We take it off, so that we can easily match data with other datasets.

new_colnames <- gsub("Y", "", colnames(temperature))
colnames(temperature) <- new_colnames
temperature <- na.omit(temperature)


# We transpose the dataset so that columns become rows. With this function we have a single column, where each row corresponds to a year, but countries became the columns now. We will work on this later in the code.
temperature <- t(temperature)

# I want the first row to be the header

colnames(temperature) <- temperature[1, ]
clean_temperature <- temperature[-1, ]

#col names in CAPS
colnames(clean_temperature) <- toupper(colnames(clean_temperature))


library(tidyr)

# Convert the matrix/array "clean_temperatures" to a data frame
temperature <- as.data.frame(clean_temperature)

# Add 'Year' as a separate column
temperature$Year <- rownames(clean_temperature)

# Reshape the data from wide to long format
temperature <- tidyr::pivot_longer(temperature, 
                                     cols = -Year, 
                                     names_to = "Country", 
                                     values_to = "Temperature")

# Reorder columns as per your desired format
temperature <- temperature[c("Year", "Country", "Temperature")]

#As we said at the beginning, we only work with data from 1970
temperature <- temperature %>% filter(Year >= 1970)
temperature$Year <- as.numeric(temperature$Year)

temperature <- temperature %>% 
  mutate(Country = ifelse(Country == "UNITED STATES OF AMERICA", "USA", Country))

#As for the attacks dataset, we add a column for the ISO code.

temperature$ISO_Code <- countrycode(temperature$Country, "country.name", "iso3c")
temperature$ISO_Code[temperature$Country %in% c("MICRONESIA") & is.na(temperature$ISO_Code)] <- "FSM"

#Now that we put ISO code, we can remove the country column
temperature <- temperature %>% select(-'Country')

DT::datatable(temperature, options = list(pageLength = 10))
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes

3.3 Sea Level Cleaning

We choose a data set about the sea level change observed during the last years. We start analyzing the sea level changes since 1993 because we did not find any data of years before. In this data set we had few columns but we only kept 2 columns that were the more relevant for us: the year and the GMSL_GIA. This last column is the global mean of sea level including the glacial isostatic adjustement (GIA). We choose the column with GIA and not without because this show the response of solid Earth to the past changes in surface loading by ice and water such as the glaciation and deglaciation.


sealevel <- read.csv(here::here("data/sealevel.csv"))

# keep the 2 columns we are interested to analyze because the other are irrelevant for our project
sealevel <- subset(sealevel, select = c(Year, GMSL_GIA))

#Show the year only the first time by creating a new column called Year2
sealevel$Year2 <- ifelse(duplicated(sealevel$Year), NA, sealevel$Year)

# delete column Year due to the creation of column Year 2
sealevel <- subset(sealevel, select = -Year)

# delete NA because it does not bring anything to our analysis
sealevel <- na.omit(sealevel)

#Change name of column Year
column <- gsub("2","",colnames(sealevel))
colnames(sealevel) <- column

DT::datatable(sealevel, options = list(pageLength = 10))
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes

3.4 Greenhouse Gas Emissions Cleaning

Finally, the GHG emissions dataset was the easiest to clean. Indeed, we already have a column for the ISO code, so we simply had to filter the years so that the observations began in 1970.

co2 <- read.csv(here::here("data/CO2.csv"))

# Change names of columns in order to have the same columns that in the other datasets
names(co2)[names(co2) == "Entity"] <- "Country"
names(co2)[names(co2) == "Annual.CO..emissions"] <- "Annual CO2 Emissions"
names(co2)[names(co2) == "Code"] <- "ISO_Code"



# keep the 3 columns we are interested to analyze
co2 <- subset(co2, select = c("ISO_Code", "Year", "Annual CO2 Emissions"))


# only keep information starting in 1970 because we want to look at the evolution of 
# the last 50 years
co2<- co2[co2$Year >= 1970, ]

co2$ISO_Code <- na_if(co2$ISO_Code, "")

co2 <- subset(co2, !is.na(ISO_Code))

DT::datatable(co2, options = list(pageLength = 10))
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes

3.5 Merged Data Set

After cleaning the four datasets, we have finally merged them. This step is essential to be able to run regression across variables that came from multiple datasets.



#MERGE FIRST TWO DATA SETS

# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data <- left_join(attacks, temperature, by = c("Year", "ISO_Code"))

#MERGE WITH THIRD

# Assuming 'Year' is the columns in both datasets for matching
merged_data2 <- left_join(merged_data, sealevel, by = c("Year"))

#MERGE WITH FOURTH

# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data3 <- left_join(merged_data2, co2, by = c("Year", "ISO_Code"))
merged_data3$Temperature <- as.numeric(as.character(merged_data3$Temperature))
# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data3 <- left_join(merged_data2, co2, by = c("Year", "ISO_Code"))

#Change name of a column
colnames(merged_data3)[colnames(merged_data3) == 'Country.x'] <- 'Country'
merged_data3$Temperature <- as.numeric(merged_data3$Temperature)


#for merged_data3, which we will use for our regressions, we need numerical variables
#therefore, we will make changes in some categorical ones

#TYPE

#Let's transform the Type variable on numeric. 1 corresponds to Boat, 2 to Unprovoked, Invalid to 3
# Provoked to 4, Questionable to 5 and Sea Disaster to 6
categories <- c("Boat" = 1, "Unprovoked" = 2, "Invalid" = 3, "Provoked" = 4, "Questionable" = 5, "Sea Disaster" = 6)
merged_data3$Type <- factor(merged_data3$Type, levels = names(categories))
merged_data3$Type <- as.numeric(merged_data3$Type)

#TIME

# Let's focus on the transformation of time where morning correspond to 0, afternoon to 1,
# evening to 2 and night to 3
merged_data3$Time <- as.numeric(factor(merged_data3$Time, levels = c("morning", "afternoon", "evening", "night")))


#SEX

merged_data3$Sex <- ifelse(merged_data3$Sex == "M", 0, 
                      ifelse(merged_data3$Sex == "F", 1, 2))


#SPECIES


merged_data3 <- merged_data3 %>%
  mutate(Species = as.numeric(factor(Species)))

# Get the top 5 species
top_species <- merged_data3 %>%
  count(Species, sort = TRUE) %>%
  slice_head(n = 5)

# Create a logical condition to include only the top 5 species
condition <- merged_data3$Species %in% top_species$Species

# Subset the data based on the condition
species_filtered_data <- merged_data3[condition, ]



DT::datatable(merged_data3, options = list(pageLength = 10))
<<<<<<< Updated upstream

Interactive map

#> # A tibble: 69 x 2
#>    Country    Attackscountry
#>    <chr>               <int>
#>  1 ARUBA                   1
#>  2 AUSTRALIA             520
#>  3 BAHAMAS                77
#>  4 BRAZIL                 96
#>  5 CANADA                  1
#>  6 CHILE                   1
#>  7 CHINA                   4
#>  8 COLOMBIA                3
#>  9 COSTA RICA              5
#> 10 CROATIA                 3
#> # i 59 more rows

5. Analysis

5.1 RQ1: Which are the primary factors influencing the occurrence of shark attacks?

After a deep Exploratory Data Analysis we thought about the relevance of some factors that might influence directly the occurrence of shark attacks. Directly in the sense that these are variables that were found in the main data set and not in a secondary one. Let’s begin then:

Before doing any regression, we will compute the correlation matrix in order to analyze our numerical variables

#>             Year    Type     Sex      Age     Time  Species
#> Year     1.00000  0.0280 -0.0546  0.09953  0.00641 -0.07148
#> Type     0.02797  1.0000 -0.1276  0.03761  0.03376 -0.10931
#> Sex     -0.05457 -0.1276  1.0000 -0.04967  0.01828  0.01406
#> Age      0.09953  0.0376 -0.0497  1.00000 -0.10560 -0.00215
#> Time     0.00641  0.0338  0.0183 -0.10560  1.00000  0.00836
#> Species -0.07148 -0.1093  0.0141 -0.00215  0.00836  1.00000
  • Thanks to this correlation matrix, we can affirm that: o Year has a slight positive correlation (0.13) with Age. The older the individual, there is a slight increase in the likelihood of being tied up by a shark, probably due to the activity the individual was doing such as surfing or swimming over time. In another side, there is a negative correlation (-0.1) with Species. o Age has a slightly positive correlation (0.11) with Time almost. o The variable Type has a slight positive correlation with Age and Species while it has a slight negative correlation with Year, Species and Sex. o There is a negative correlation between Year and Species. o The following variables have no significant correlation with the other numeric variables, due to the value of the coefficient close to 0: Sex and Type.

The regression that we will use is the following one:

\[ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \epsilon \]

where:

  1. \(\hat{Y}\) = Frequency of Shark Attacks
  2. \(X_{\text{1}}\) = Year
  3. \(X_{\text{2}}\) = Sex
  4. \(X_{\text{3}}\) = Age
  5. \(X_{\text{4}}\) = Species 6.\(X_{\text{5}}\) = Type
  6. \(X_{\text{6}}\) = Time
  7. \(\epsilon\) = Residual Error

Let’s do our main regression based on all the countries

#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Type + 
#>     Time, data = merged_map)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -1208   -561    350    529   1219 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -6116.783   1764.298   -3.47  0.00053 ***
#> Year            3.716      0.879    4.23  2.4e-05 ***
#> Sex           -33.807     20.081   -1.68  0.09239 .  
#> Age            -9.556      0.855  -11.18  < 2e-16 ***
#> Species        -2.017      1.330   -1.52  0.12949    
#> Type          -45.688     15.115   -3.02  0.00253 ** 
#> Time            6.008     16.467    0.36  0.71525    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 603 on 2894 degrees of freedom
#> Multiple R-squared:  0.0509, Adjusted R-squared:  0.049 
#> F-statistic: 25.9 on 6 and 2894 DF,  p-value: <2e-16
Regression on the main factors influencing shark attacks
Dependent variable:
Attackscountry
Year 3.720***
(0.879)
Sex -33.800*
(20.100)
Age -9.560***
(0.855)
Species -2.020
(1.330)
Type -45.700***
(15.100)
Time 6.010
(16.500)
Constant -6,117.000***
(1,764.000)
Observations 2,901
R2 0.051
Adjusted R2 0.049
Residual Std. Error 603.000 (df = 2894)
F Statistic 25.900*** (df = 6; 2894)
Note: p<0.1; p<0.05; p<0.01
  • We just did a linear regression where the response variable is Attackscountry predicted by the following variables: Year, Sex, Age, Time, Type and Species.
  • The intercept represents the predicted value of Attackscountry when almost all variables are zero except Age that could not be zero.
  • The variable Year is statistically significant (p-value < 0.05), indicating that there is a significant relationship between the Year and the number of shark attacks. The variable Type is statistically significant at the 0.1 level.
  • The variable Age is also statistically significant at the same level (0.1%).
  • In the other side, the variables that are not statistically significant at this level are: Sex and Species.
  • The R-squared in this model is of 5.2 %. Thus, 5.2% of the variance in the response variable (Attackscountry) is explained by this model. The adjusted R-squared is similar, 5 %. This model does not explain a large portion of the variability of this variable.
  • Concerning the F-statistic we can say that there is a low p-value suggesting that at least one predictor variable is significantly related to the response variable.

This first regression will be based only on the USA

#> [1] "Regression for USA:"
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Type + 
#>     Time, data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014306 -0.0000000000001  0.0000000000011  0.0000000000021 
#>              Max 
#>  0.0000000000041 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001689386    0.0000000001644062
#> Year          -0.0000000000001034    0.0000000000000819
#> Sex            0.0000000000012479    0.0000000000019505
#> Age            0.0000000000000542    0.0000000000000701
#> Species       -0.0000000000000568    0.0000000000001135
#> Type           0.0000000000000894    0.0000000000016119
#> Time          -0.0000000000002874    0.0000000000014769
#>                      t value            Pr(>|t|)    
#> (Intercept) 8996012496841.87 <0.0000000000000002 ***
#> Year                   -1.26                0.21    
#> Sex                     0.64                0.52    
#> Age                     0.77                0.44    
#> Species                -0.50                0.62    
#> Type                    0.06                0.96    
#> Time                   -0.19                0.85    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1472 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.498 
#> F-statistic:  245 on 6 and 1472 DF,  p-value: <0.0000000000000002
#>    Year     Sex     Age Species    Type    Time 
#>    1.03    1.02    1.03    1.03    1.04    1.01
Regression on the main factors influencing shark attacks focused on the USA
Dependent variable:
Attackscountry
Year -0.000
(0.000)
Sex 0.000
(0.000)
Age 0.000
(0.000)
Species -0.000
(0.000)
Type 0.000
(0.000)
Time -0.000
(0.000)
Constant 1,479.000***
(0.000)
Observations 1,479
R2 0.500
Adjusted R2 0.498
Residual Std. Error 0.000 (df = 1472)
F Statistic 245.000*** (df = 6; 1472)
Note: p<0.1; p<0.05; p<0.01
  • We focus on the country that has the most shark attacks in the world according to our main dataset.
  • None of the predictor variables are statistically significant predicting the number of shark attacks in the USA.
  • This model fits the data concerning USA very well, because of the high R-Squared value (0.5) and the small residuals. Does this model fit well or is there some multicollinearity?
  • According to the VIF analysis, the values are low. As they are below 5, we cannot say that there is high multicollinearity between some variables because here all the values are close to 1. Each variable contributes relatively independently to explaining the variability of Attackscountry.
#> Variable added: Year  | AIC: -66827 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year, data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014319  0.0000000000003  0.0000000000012  0.0000000000019 
#>              Max 
#>  0.0000000000025 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001439275    0.0000000001614318
#> Year          -0.0000000000000911    0.0000000000000807
#>                      t value            Pr(>|t|)    
#> (Intercept) 9161765526064.88 <0.0000000000000002 ***
#> Year                   -1.13                0.26    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1477 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:   0.5 
#> F-statistic: 1.48e+03 on 1 and 1477 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Sex  | AIC: -66826 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex, data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014315  0.0000000000002  0.0000000000012  0.0000000000019 
#>              Max 
#>  0.0000000000028 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001498393    0.0000000001618005
#> Year          -0.0000000000000943    0.0000000000000809
#> Sex            0.0000000000011236    0.0000000000019330
#>                      t value            Pr(>|t|)    
#> (Intercept) 9140887108111.84 <0.0000000000000002 ***
#> Year                   -1.17                0.24    
#> Sex                     0.58                0.56    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1476 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.499 
#> F-statistic:  738 on 2 and 1476 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Age  | AIC: -66824 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age, data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014309  0.0000000000000  0.0000000000011  0.0000000000021 
#>              Max 
#>  0.0000000000038 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001623448    0.0000000001625604
#> Year          -0.0000000000001011    0.0000000000000813
#> Sex            0.0000000000012543    0.0000000000019402
#> Age            0.0000000000000553    0.0000000000000695
#>                      t value            Pr(>|t|)    
#> (Intercept) 9098156172803.24 <0.0000000000000002 ***
#> Year                   -1.24                0.21    
#> Sex                     0.65                0.52    
#> Age                     0.80                0.43    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1475 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.499 
#> F-statistic:  492 on 3 and 1475 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Species  | AIC: -66823 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species, data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014307 -0.0000000000001  0.0000000000011  0.0000000000021 
#>              Max 
#>  0.0000000000041 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001696208    0.0000000001632457
#> Year          -0.0000000000001039    0.0000000000000815
#> Sex            0.0000000000012481    0.0000000000019407
#> Age            0.0000000000000558    0.0000000000000696
#> Species       -0.0000000000000577    0.0000000000001121
#>                      t value            Pr(>|t|)    
#> (Intercept) 9059961123428.15 <0.0000000000000002 ***
#> Year                   -1.27                0.20    
#> Sex                     0.64                0.52    
#> Age                     0.80                0.42    
#> Species                -0.51                0.61    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1474 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.499 
#> F-statistic:  368 on 4 and 1474 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Time  | AIC: -66821 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time, 
#>     data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014306 -0.0000000000001  0.0000000000011  0.0000000000021 
#>              Max 
#>  0.0000000000041 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001700755    0.0000000001633070
#> Year          -0.0000000000001038    0.0000000000000816
#> Sex            0.0000000000012382    0.0000000000019420
#> Age            0.0000000000000544    0.0000000000000700
#> Species       -0.0000000000000577    0.0000000000001122
#> Time          -0.0000000000002849    0.0000000000014757
#>                      t value            Pr(>|t|)    
#> (Intercept) 9056559928503.87 <0.0000000000000002 ***
#> Year                   -1.27                0.20    
#> Sex                     0.64                0.52    
#> Age                     0.78                0.44    
#> Species                -0.51                0.61    
#> Time                   -0.19                0.85    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1473 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.498 
#> F-statistic:  295 on 5 and 1473 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Type  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time + 
#>     Type, data = usa_data)
#> 
#> Residuals:
#>              Min               1Q           Median               3Q 
#> -0.0000000014306 -0.0000000000001  0.0000000000011  0.0000000000021 
#>              Max 
#>  0.0000000000041 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 1479.0000000001689386    0.0000000001644062
#> Year          -0.0000000000001034    0.0000000000000819
#> Sex            0.0000000000012479    0.0000000000019505
#> Age            0.0000000000000542    0.0000000000000701
#> Species       -0.0000000000000568    0.0000000000001135
#> Time          -0.0000000000002874    0.0000000000014769
#> Type           0.0000000000000894    0.0000000000016119
#>                      t value            Pr(>|t|)    
#> (Intercept) 8996012496841.87 <0.0000000000000002 ***
#> Year                   -1.26                0.21    
#> Sex                     0.64                0.52    
#> Age                     0.77                0.44    
#> Species                -0.50                0.62    
#> Time                   -0.19                0.85    
#> Type                    0.06                0.96    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.0000000000373 on 1472 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.498 
#> F-statistic:  245 on 6 and 1472 DF,  p-value: <0.0000000000000002

In order to have a better model by doing a forward selection where we will take the model with the lowest AIC.

In order to have a better model by doing a forward selection where we will take the model with the lowest AIC.

Here we have the AIC values for each model: 1. Model with Year only: AIC = -66827.49 2. Model with Year and Sex: AIC = -66825.83 3. Model with Year, Sex, and Age: AIC = -66824.46 4. Model with Year, Sex, Age, and Species: AIC = -66822.72 5. Model with Year, Sex, Age, Species and Time: AIC = -66820.77 6. Model with Year, Sex, Age, Species, Time and Type: AIC = -66818.77

As said previously, we chose the model with the lowest AIC indicating a better fit of the model, but we decided to choose it because the model includes more variables than the others, thus best explaining the variability in the data.

This second regression will be based only on Australia

#> [1] "Regression for AUSTRALIA:"
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Type + 
#>     Time, data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012389  0.00000000000007  0.00000000000024 
#>                3Q               Max 
#>  0.00000000000051  0.00000000000062 
#> 
#> Coefficients:
#>                           Estimate             Std. Error
#> (Intercept) 519.999999999993406163   0.000000000037820421
#> Year          0.000000000000000107   0.000000000000018791
#> Sex           0.000000000000195984   0.000000000000413708
#> Age           0.000000000000003260   0.000000000000020697
#> Species      -0.000000000000004605   0.000000000000032269
#> Type          0.000000000000052540   0.000000000000292085
#> Time          0.000000000000332463   0.000000000000342080
#>                       t value            Pr(>|t|)    
#> (Intercept) 13749186062032.52 <0.0000000000000002 ***
#> Year                     0.01                1.00    
#> Sex                      0.47                0.64    
#> Age                      0.16                0.87    
#> Species                 -0.14                0.89    
#> Type                     0.18                0.86    
#> Time                     0.97                0.33    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000548 on 513 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.494 
#> F-statistic: 85.5 on 6 and 513 DF,  p-value: <0.0000000000000002
Regression on the main factors influencing shark attacks focused on AUSTRALIA
Dependent variable:
Attackscountry
Year 0.000
(0.000)
Sex 0.000
(0.000)
Age 0.000
(0.000)
Species -0.000
(0.000)
Type 0.000
(0.000)
Time 0.000
(0.000)
Constant 520.000***
(0.000)
Observations 520
R2 0.500
Adjusted R2 0.494
Residual Std. Error 0.000 (df = 513)
F Statistic 85.500*** (df = 6; 513)
Note: p<0.1; p<0.05; p<0.01
  • We focus on the second country that has the most shark attacks in the world according to our main dataset.
  • None of these variables are statistically significant. The residual standard error is very small and the multiple R-squared is approximately 50%. It indicates that there is a good fit and the variability is explained by 50% by the model, respectively.
#> Variable added: Year  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year, data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012420  0.00000000000023  0.00000000000025 
#>                3Q               Max 
#>  0.00000000000026  0.00000000000027 
#> 
#> Coefficients:
#>                          Estimate            Std. Error
#> (Intercept) 519.99999999999772626   0.00000000003604512
#> Year         -0.00000000000000173   0.00000000000001800
#>                      t value            Pr(>|t|)    
#> (Intercept) 14426365221367.6 <0.0000000000000002 ***
#> Year                    -0.1                0.92    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000546 on 518 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.499 
#> F-statistic:  517 on 1 and 518 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Sex  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex, data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012415  0.00000000000027  0.00000000000029 
#>                3Q               Max 
#>  0.00000000000029  0.00000000000030 
#> 
#> Coefficients:
#>                           Estimate             Std. Error
#> (Intercept) 519.999999999995338840   0.000000000036493311
#> Year         -0.000000000000000546   0.000000000000018213
#> Sex           0.000000000000176122   0.000000000000402815
#>                       t value            Pr(>|t|)    
#> (Intercept) 14249186641170.37 <0.0000000000000002 ***
#> Year                    -0.03                0.98    
#> Sex                      0.44                0.66    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000547 on 517 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.498 
#> F-statistic:  259 on 2 and 517 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Age  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age, data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012415  0.00000000000027  0.00000000000029 
#>                3Q               Max 
#>  0.00000000000029  0.00000000000030 
#> 
#> Coefficients:
#>                           Estimate             Std. Error
#> (Intercept) 519.999999999995338840   0.000000000036790730
#> Year         -0.000000000000000566   0.000000000000018402
#> Sex           0.000000000000176164   0.000000000000403241
#> Age           0.000000000000000161   0.000000000000020290
#>                       t value            Pr(>|t|)    
#> (Intercept) 14133995074177.91 <0.0000000000000002 ***
#> Year                    -0.03                0.98    
#> Sex                      0.44                0.66    
#> Age                      0.01                0.99    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000547 on 516 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.497 
#> F-statistic:  172 on 3 and 516 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Species  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species, data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012414  0.00000000000021  0.00000000000030 
#>                3Q               Max 
#>  0.00000000000031  0.00000000000034 
#> 
#> Coefficients:
#>                           Estimate             Std. Error
#> (Intercept) 519.999999999996248334   0.000000000037075547
#> Year         -0.000000000000000868   0.000000000000018490
#> Sex           0.000000000000180007   0.000000000000404139
#> Age           0.000000000000000389   0.000000000000020345
#> Species      -0.000000000000005997   0.000000000000032010
#>                       t value            Pr(>|t|)    
#> (Intercept) 14025416977620.41 <0.0000000000000002 ***
#> Year                    -0.05                0.96    
#> Sex                      0.45                0.66    
#> Age                      0.02                0.98    
#> Species                 -0.19                0.85    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000548 on 515 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.496 
#> F-statistic:  129 on 4 and 515 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Time  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time, 
#>     data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012390  0.00000000000010  0.00000000000022 
#>                3Q               Max 
#>  0.00000000000051  0.00000000000062 
#> 
#> Coefficients:
#>                           Estimate             Std. Error
#> (Intercept) 519.999999999994656719   0.000000000037106145
#> Year         -0.000000000000000473   0.000000000000018494
#> Sex           0.000000000000180390   0.000000000000404142
#> Age           0.000000000000003587   0.000000000000020597
#> Species      -0.000000000000005281   0.000000000000032018
#> Time          0.000000000000338769   0.000000000000339958
#>                       t value            Pr(>|t|)    
#> (Intercept) 14013851309421.74 <0.0000000000000002 ***
#> Year                    -0.03                0.98    
#> Sex                      0.45                0.66    
#> Age                      0.17                0.86    
#> Species                 -0.16                0.87    
#> Time                     1.00                0.32    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000548 on 514 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.495 
#> F-statistic:  103 on 5 and 514 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Type  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time + 
#>     Type, data = aus_data)
#> 
#> Residuals:
#>               Min                1Q            Median 
#> -0.00000000012389  0.00000000000007  0.00000000000024 
#>                3Q               Max 
#>  0.00000000000051  0.00000000000062 
#> 
#> Coefficients:
#>                           Estimate             Std. Error
#> (Intercept) 519.999999999993406163   0.000000000037820421
#> Year          0.000000000000000107   0.000000000000018791
#> Sex           0.000000000000195984   0.000000000000413708
#> Age           0.000000000000003260   0.000000000000020697
#> Species      -0.000000000000004605   0.000000000000032269
#> Time          0.000000000000332463   0.000000000000342080
#> Type          0.000000000000052540   0.000000000000292085
#>                       t value            Pr(>|t|)    
#> (Intercept) 13749186062032.53 <0.0000000000000002 ***
#> Year                     0.01                1.00    
#> Sex                      0.47                0.64    
#> Age                      0.16                0.87    
#> Species                 -0.14                0.89    
#> Time                     0.97                0.33    
#> Type                     0.18                0.86    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.00000000000548 on 513 degrees of freedom
#> Multiple R-squared:   0.5,   Adjusted R-squared:  0.494 
#> F-statistic: 85.5 on 6 and 513 DF,  p-value: <0.0000000000000002
#>    Year     Sex     Age Species    Type    Time 
#>    1.08    1.07    1.06    1.03    1.09    1.04
  • In this case, for each model we have the same AIC, that has a value of -66818.77. Increasing the number of variables that we add in the model does not seem to improve the significance of the model. As the AIC value is the same, we prefer to have as many variables as possible.
  • The model with all the variables will explain almost 50% of the variability. But this value is high, but according to the VIF analysis, there is no high multicollinearity between variables. We can conclude saying that this model fits well.

Let’s focus on both countries

#> [1] "Regression for USA and AUSTRALIA:"
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Type + 
#>     Time, data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -917   -496    190    250    603 
#> 
#> Coefficients:
#>             Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) 3786.535   1512.587    2.50              0.01238 *  
#> Year          -1.123      0.753   -1.49              0.13616    
#> Sex          -23.274     17.490   -1.33              0.18343    
#> Age           -5.739      0.677   -8.48 < 0.0000000000000002 ***
#> Species       -4.165      1.103   -3.77              0.00017 ***
#> Type         -28.178     13.689   -2.06              0.03968 *  
#> Time          27.132     13.682    1.98              0.04750 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 410 on 1992 degrees of freedom
#> Multiple R-squared:  0.0514, Adjusted R-squared:  0.0485 
#> F-statistic:   18 on 6 and 1992 DF,  p-value: <0.0000000000000002
Regression on the main factors influencing shark attacks focused on the USA and AUSTRALIA
Dependent variable:
Attackscountry
Year -1.120
(0.753)
Sex -23.300
(17.500)
Age -5.740***
(0.677)
Species -4.170***
(1.100)
Type -28.200**
(13.700)
Time 27.100**
(13.700)
Constant 3,787.000**
(1,513.000)
Observations 1,999
R2 0.051
Adjusted R2 0.049
Residual Std. Error 410.000 (df = 1992)
F Statistic 18.000*** (df = 6; 1992)
Note: p<0.1; p<0.05; p<0.01
  • We focus on the two countries that have the most shark attacks in the world according to our main dataset.
  • Year and Sex are the two variables in this model that are not statistically significant while the others are, at different thresholds. Age and Species are statistically significant at 0.1%, while Time is statistically significant at 10%. The variable Type is statistically significant at 5%.
  • The variability of Attackscountry has a value of 5.25% (R^2) and is explained by the model. We conclude that this model is better explained by another model that contains more useful variables.
#> Variable added: Year  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year, data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -759   -684    244    259    275 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)  4354.25    1520.51    2.86   0.0042 **
#> Year           -1.56       0.76   -2.06   0.0400 * 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 420 on 1997 degrees of freedom
#> Multiple R-squared:  0.00211,    Adjusted R-squared:  0.00161 
#> F-statistic: 4.22 on 1 and 1997 DF,  p-value: 0.04
#> 
#> Variable added: Sex  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex, data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -762   -682    242    261    291 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)  4357.61    1520.77    2.87   0.0042 **
#> Year           -1.56       0.76   -2.06   0.0400 * 
#> Sex           -10.32      17.73   -0.58   0.5608   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 421 on 1996 degrees of freedom
#> Multiple R-squared:  0.00228,    Adjusted R-squared:  0.00128 
#> F-statistic: 2.28 on 2 and 1996 DF,  p-value: 0.103
#> 
#> Variable added: Age  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age, data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -838   -509    187    242    614 
#> 
#> Coefficients:
#>             Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept) 2938.621   1498.992    1.96                0.05 .  
#> Year          -0.771      0.750   -1.03                0.30    
#> Sex          -20.014     17.418   -1.15                0.25    
#> Age           -6.093      0.672   -9.06 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 412 on 1995 degrees of freedom
#> Multiple R-squared:  0.0417, Adjusted R-squared:  0.0403 
#> F-statistic: 28.9 on 3 and 1995 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Species  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species, data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -910   -501    194    245    623 
#> 
#> Coefficients:
#>             Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) 3444.180   1501.479    2.29              0.02190 *  
#> Year          -0.959      0.750   -1.28              0.20072    
#> Sex          -19.307     17.369   -1.11              0.26646    
#> Age           -5.997      0.671   -8.94 < 0.0000000000000002 ***
#> Species       -3.877      1.095   -3.54              0.00041 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 411 on 1994 degrees of freedom
#> Multiple R-squared:  0.0477, Adjusted R-squared:  0.0458 
#> F-statistic:   25 on 4 and 1994 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Time  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time, 
#>     data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -907   -504    192    249    615 
#> 
#> Coefficients:
#>             Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) 3380.836   1500.909    2.25              0.02440 *  
#> Year          -0.953      0.749   -1.27              0.20337    
#> Sex          -18.691     17.361   -1.08              0.28181    
#> Age           -5.839      0.676   -8.64 < 0.0000000000000002 ***
#> Species       -3.858      1.094   -3.53              0.00043 ***
#> Time          25.669     13.675    1.88              0.06065 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 411 on 1993 degrees of freedom
#> Multiple R-squared:  0.0494, Adjusted R-squared:  0.047 
#> F-statistic: 20.7 on 5 and 1993 DF,  p-value: <0.0000000000000002
#> 
#> Variable added: Type  | AIC: -66819 
#> 
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time + 
#>     Type, data = combined_data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>   -917   -496    190    250    603 
#> 
#> Coefficients:
#>             Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) 3786.535   1512.587    2.50              0.01238 *  
#> Year          -1.123      0.753   -1.49              0.13616    
#> Sex          -23.274     17.490   -1.33              0.18343    
#> Age           -5.739      0.677   -8.48 < 0.0000000000000002 ***
#> Species       -4.165      1.103   -3.77              0.00017 ***
#> Time          27.132     13.682    1.98              0.04750 *  
#> Type         -28.178     13.689   -2.06              0.03968 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 410 on 1992 degrees of freedom
#> Multiple R-squared:  0.0514, Adjusted R-squared:  0.0485 
#> F-statistic:   18 on 6 and 1992 DF,  p-value: <0.0000000000000002
  • Here we have the same AIC value for each model: -66818.77. As said previously, we decide to use the model containing the most variables.

We have done the following type of models of this first regression that will help us to answer this first research question: - Overall: all countries that we have on our merged dataset. - Focus on USA. - Focus on Australia. - Focus on the two countries above.

Each type of model according to their focus has its own primary factors influencing the occurrence of shark attacks. For example, in the first one (Overall), Age, Year, Type and Time are the factors that influence the attacks. But the two that are highly significant are Age and Year. In the 4th one, the one that is focused on both countries, the main factors are Age, Species, Type and Time. In conclusion, Age, Type and Time are highly significant predictors, thus they strongly influence the occurrence of shark attacks from a statistical point of view. We did not consider models focus on one country because they are not that relevant. We choose to analyze more than 1 country because it makes more sense. We analyzed the two countries with the most shark attacks because we had the most data about these countries. It gives us more details and the significative factors that influence the shark attacks. We reduce the complexity of our model because of the focus on the regions with the most attacks, so it makes our results more relevant.

=======

4. EDA

4.1 Evolution of shark attacks through the years

With the y-axis showing the frequency of shark attacks and the x-axis showing the passage of time, the graph below displays a striking pattern in shark attacks over time. The data exhibits a recognizable pattern that indicates a steady increase in the quantity of shark attacks within the designated period. This rising tendency begs critical concerns regarding what is causing such an increase and necessitates more research into ecological, environmental, and human-related elements that could be involved in this occurrence. The graph emphasizes the need of keeping an eye on and comprehending the dynamics of these episodes in addition to providing a visual depiction of the rise in shark attacks.

This trend is a key research topic: why are marine attacks increasing year after year? What could be the reasons? In fact, the reasons can be various, for example of an ecological, environmental nature or linked to human seaside activity. The graph highlights the need to keep an eye on and understand the dynamics of these episodes, which is exactly what we will do during our study.

Three fundamental areas of this graph should be an element of analysis: 1975-1978; 1988-1990; 2019-2020. In fact, we could ask ourselves, why is there a decrease in attacks in this period? For the last period, the answer is simple: our dataset presents data up to June 2018, for the second half of the year we have no data, which explains the drastic drop in attacks. For the other two periods, we will discover the reasons during a more in-depth analysis

plot1 <- ggplot(data = attacks, aes(x = Year)) +
  geom_smooth(stat = "count", aes(group = 1), color = "blue", size = 1) +
  geom_bar(stat = "count", fill = "red") +
  labs(title = "Shark Attacks Evolution Over Years", x = "Year", y = "Number of Attacks")

plot1

>>>>>>> Stashed changes

4.2 Pattern of shark attacks frequencies

Pattern of shark attacks frequencies through the day

The bar graph below, where the x-axis indicates four time categories —morning, afternoon, evening, and night— displays the distribution of shark attacks throughout the day. The graph’s most noticeable aspect is how frequently shark attacks occur in the afternoon, followed by the morning, evening and night. This concentration of incidents in the afternoon raises intriguing questions about potential human behavioral factors that could contribute to this temporal pattern.

First and foremost, we believe that the afternoon is when most people choose to go swimming, which contributes to the frequency of shark attacks during this time. Recreational water activities such as swimming, surfing, kayaking, and other water sports are probably more popular during these times. Furthermore, the afternoon is when the seas are the warmest, so it’s intriguing to find out if this aspect also has a significant influence on shark activity. Conversely, the decreased number of assaults that occur at night may be related to less people using the water during that time, but there may be other possible reasons as well (less high temperatures, poorer visibility).

Lastly, we suppose that the medium frequencies of the morning and evening might correspond to transitional periods when shark and human behavior are changing.


bar_colors <- c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3")

# Create a sorted table
sorted_table <- table(attacks$Time)
sorted_table <- sorted_table[order(-sorted_table)]

# Create a bar plot with the sorted data
barplot1 <- barplot(sorted_table,
                    col = bar_colors,
                    xlab = "Time of Day", ylab = "Frequency of Attacks",
                    main = "Frequency of Shark Attacks by Time of Day",
                    border = "white",
                    ylim = c(0, 1800),
                    space = 0.5,
                    cex.names = 0.8,
                    font.axis = 2,
                    beside = TRUE)

<<<<<<< Updated upstream
model12 <- glm(Fatality ~ Date + Year + Age + Time, 
               data = filtered_data4, 
               family = "binomial")
Regression on the fatality of shark attacks
Dependent variable:
Fatality
Date -0.006
(0.026)
Year -0.030***
(0.006)
Age 0.028***
(0.006)
Time 0.086
(0.124)
Constant 55.500***
(12.300)
Observations 2,319
Log Likelihood -553.000
Akaike Inf. Crit. 1,115.000
Note: p<0.1; p<0.05; p<0.01

We will start by analiyzing the significant coefficient.

The intercept of 55.643 is the estimated log-odds of fatality when all predictors are zero. Since, our variable of interest is a binomal one, it can be difficult to interpret this log-odds directly, though, and it might not be clear how it affects the likelihood of death. This value needs a deepen econometric study in order to be comprehended.

The coefficient for ‘Year’ is -0.030, suggesting that the likelihood of a fatal outcome in shark attacks decreases by around 0.0324.as the year increases. This trend may be caused by developments in emergency medical response and care, public education and awareness campaigns about shark safety, enhanced beach surveillance and warning systems, or adjustments in recreational activities and behavior near coastal areas.

The ‘Age’ coefficient is 0.027, indicating that for each unit increase in age, the log-odds of fatality increase by 0.0269. One explanation for this correlation may be that older people have less physical strength and agility, which makes it harder for them to flee from or defend against a shark attack. Additionally, older people may be more vulnerable to severe injuries from shark bites. Furthermore, the observed higher risk of death in older age groups may be explained by behavioral factors such as an increased propensity to participate in riskier water activities or an increased amount of time spent in the water.

The last two coefficients, that of Date and Time are not significant, but let is dive into them!

The ‘Date’ coefficient is -0.006, but with a p-value of 0.85, it is not statistically significant, implying that the month of the shark attack may not be a strong predictor of fatality. But this was not a surprise. Indeed, the three countries we took into consideration have seasons that vary throughout the year, based on their position on the globe. For example, Summer in USA goes from June to September, but in Australia and South Africa it goes from December to February/March. This might create a disequilibrium when R reads them.

Remarkably, the ‘Time’ coefficient is 0.017, indicating that the time of day may not be a predictive factor for shark attack fatalities. The lack of significance of ‘Time’ may suggest that the time of an attack by a shark has little bearing on whether it ends fatally. Intuitively, we thought that night attacks would be more fatal, given the difficulty in asking or receiving help, but apparently this is not the case.

The model’s goodness of fit is assessed through the null and residual deviances, with a lower residual deviance of 1079 on 2190 degrees of freedom compared to the null deviance of 1120.4 on 2194 degrees of freedom. This reduction in deviance indicates that the model explains some of the variability in the data. The Akaike Information Criterion (AIC) of 1089 is a measure of the model’s relative quality, considering both fit and complexity.

======= # Add text labels with frequencies on the bars text_labels <- sorted_table text(x = barplot1, y = sorted_table, labels = text_labels, pos = 3, cex = 0.8, col = "black")

Pattern of shark attacks frequencies per country

A first glance at our plotted data reveals that the United States, Australia, and South Africa are the three primary countries with a significant number of shark attacks.

shark_attacks_by_country <- attacks %>%
  group_by(Country) %>%
  summarise(total_attacks = n())

shark_attacks_by_country <- shark_attacks_by_country %>%
  mutate(CountryCategory = ifelse(total_attacks >= 30, as.character(Country), "Other"))

# Order the countries by frequency in descending order
shark_attacks_by_country$CountryCategory <- reorder(shark_attacks_by_country$CountryCategory, -shark_attacks_by_country$total_attacks)

# Plot the data
plot2<- ggplot(data = shark_attacks_by_country, aes(x = total_attacks, y = CountryCategory)) +
  geom_col(fill = "skyblue") +
  labs(title = "Total Shark Attacks in Each Country", x = "Number of Attacks", y = "Country") +
  theme_minimal() +
  theme(axis.text.y = element_text(hjust = 1)) +
  scale_y_discrete(labels = scales::label_wrap(10))

interactive_plot <- ggplotly(plot2)
interactive_plot
>>>>>>> Stashed changes

4.3 Fatality of shark attacks

Are shark attacks more likely to lead to death?

IT WAS A PIECHART…

Fatalities by type of shark

As we can see in this graph, most of attacks were not fatal, meaning that sharks injured a lot of people without taking their lives away. The types of sharks that did the most of accidents were the following ones: Bull shark, Tiger shark and White shark, with the deadlies being the last one. Why do they commit attacks? Probably because their feeding behavior. Maybe they misidentify their predators attacking people instead of animals like turtles, seals, or sea lions. This is consistent with Bull shark and Tiger shark being repetitevly named as the deadlies sharks in the seas and oeceans (CBS News, 2010). Another relevant information derived from this graph is the big number of unidentified sharks. Out of all the attacks recorded by our dataset, we have almost 600 Unidentified species. It is such a pity not to be able to have this information, which would have led us to more insightful results.